0.1 Data

The UN data set available in R is used for this analysis. It provides information on national health, welfare and population statistics for 213 locations, mainly UN members. Different hypothesis are tested to gain insights into the differences in the groups and regions.

fertility: total fertility rate, number of children per woman

ppgdp: Gross domestic product per capita in US dollars

lifeExpF: female life expectancy in years

pctUrban: urban population in %

infantMortality: Infant mortality at the age of 1 year per 1000 live births

groups: OECD, Africa, Other

region: Africa, Asia, Caribbean, Europe, Latin Amer, North America, NorthAtlantic, Oceania

0.2 Loading/cleaning the data and descriptive statistics

data(UN)
head(UN)
#omit NAs
anyNA(UN)
#TRUE
UN_data <- na.omit(UN)
#summary
summary(UN_data)
##         region      group       fertility         ppgdp         
##  Africa    :52   oecd  : 31   Min.   :1.134   Min.   :   114.8  
##  Asia      :50   other :110   1st Qu.:1.750   1st Qu.:  1239.8  
##  Europe    :39   africa: 52   Median :2.264   Median :  4495.8  
##  Latin Amer:20                Mean   :2.780   Mean   : 12291.1  
##  Oceania   :17                3rd Qu.:3.700   3rd Qu.: 14497.3  
##  Caribbean :13                Max.   :6.925   Max.   :105095.4  
##  (Other)   : 2                                                  
##     lifeExpF        pctUrban     infantMortality  
##  Min.   :48.11   Min.   : 11.0   Min.   :  1.916  
##  1st Qu.:65.10   1st Qu.: 39.0   1st Qu.:  7.243  
##  Median :75.57   Median : 59.0   Median : 19.637  
##  Mean   :72.08   Mean   : 57.1   Mean   : 30.739  
##  3rd Qu.:79.07   3rd Qu.: 75.0   3rd Qu.: 45.892  
##  Max.   :87.12   Max.   :100.0   Max.   :124.535  
## 

Scatter plots for a basic understanding of the data

pairs(UN_data[, c("fertility", "ppgdp", "lifeExpF", "pctUrban", "infantMortality")],
      lower.panel = NULL) 

0.3 Hypothesis

0.3.1 H1 and H2

H1: There are significant differences between Africa and other country groups in terms of life expectancy and fertility rate.

H2: Life expectancy of women in African countries is significantly lower than in other country groups and in OECD countries.

0.3.1.1 mean and sd by group

mean_overview_group <- UN_data %>%
  select(group, fertility, ppgdp, lifeExpF, pctUrban, infantMortality) %>%
  group_by(group) %>%
  mutate(N = n()) %>%  
  summarise_all(mean)

print(mean_overview_group)
## # A tibble: 3 Ă— 7
##   group  fertility  ppgdp lifeExpF pctUrban infantMortality     N
##   <fct>      <dbl>  <dbl>    <dbl>    <dbl>           <dbl> <dbl>
## 1 oecd        1.77 37761.     82.4     75.8            4.89    31
## 2 other       2.36  9819.     75.1     58.8           21.7    110
## 3 africa      4.27  2337.     59.4     42.4           65.3     52
sd_overview_group <- UN_data %>%
  select(group, fertility, ppgdp, lifeExpF, pctUrban, infantMortality) %>%
  group_by(group) %>%
  summarise_all(sd)

print(sd_overview_group)
## # A tibble: 3 Ă— 6
##   group  fertility  ppgdp lifeExpF pctUrban infantMortality
##   <fct>      <dbl>  <dbl>    <dbl>    <dbl>           <dbl>
## 1 oecd       0.340 22092.     2.09     11.7            3.50
## 2 other      0.941 12502.     5.66     23.4           17.5 
## 3 africa     1.29   3424.     8.39     17.7           27.5

0.3.1.2 Visualizations

Examinnation of the relationships between the variables using visualizations.

hist(UN_data$lifeExpF, main = "", xlab = "Life expectancy of women in years")

hist(UN_data$fertility, main = "", xlab = "Fertility rate, number of children per woman")

ggplot(UN_data, aes(x = group, y = lifeExpF, color = group)) +
  geom_boxplot() +
  labs(
    x = "Group",
    y = "Life Expectancy of women (years)"
  )+
   scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+
  theme_minimal()+
  guides(color = FALSE)

ggplot(UN_data, aes(x = group, y = fertility, color = group)) +
  geom_boxplot() +
  labs(
    x = "Group",
    y = "Fertility rate (number of children per woman)"
  ) +
  scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+
  theme_minimal()+
  guides(color = FALSE)

color <- c("#00AFBB", "#E7B800", "#FC4E07")
fig <- plot_ly(UN_data, x = ~fertility, y = ~lifeExpF,
               size = ~fertility, sizes = c(10, 1000), color = ~group, colors = color)
fig <- fig %>% layout(xaxis = list(title = "Fertility"),
                      yaxis = list(title = "Life expectancy"))
fig

0.3.1.3 Hypothesis testing

Checking the prerequisites for an ANOVA

  • Normal distribution of the residuals

  • Homogeneity of variance (Homoscedasticity)

Testing the residuals for normal distribution

The QQ plot compares the empirical quantiles of the residuals with the theoretical quantiles of a normal distribution. A good fit is indicated by an approximately straight-line relationship between the empirical and theoretical quantiles.

#linear mode
model_fertility <- lm(fertility ~ group, data = UN_data)
#Q-Q-Plot
ggqqplot(residuals(model_fertility))

#model
model_lifeExpF <- lm(lifeExpF ~ group, data = UN_data)
#Q-Q-Plot
ggqqplot(residuals(model_lifeExpF))

The residuals of both variables do not seem to be clearly normally distributed. As a parametric method, one-way ANOVA provides the best interpretable results if the dependent variable is approximately normally distributed in each group.

Testing for homoscedasticity

leveneTest(data= UN_data, lifeExpF ~ group)

# p=1.516e-06 < 0.05 there is no homogeneity of variance

leveneTest(data= UN_data, fertility ~ group)

# p= 4.358e-06 <0.05 there is no homogeneity of variance

Variance homogeneity is an important prerequisite for ANOVA, as it ensures that the variance of the dependent variables are approximately the same across the groups. If this assumption is violated, the ANOVA results may be unreliable and lead to erroneous conclusions.

plot(model_fertility, 1)

plot(model_lifeExpF, 1)

The Residuals vs. Fitted diagram shows the residuals against the predicted values. A random pattern without systematic trends indicates homoscedasticity. The data points in the Residuals vs. Fitted Diagram in both cases almost form a triangle shape, it indicates heteroscedasticity, which means that the variance of the residuals is not constant across the range of fitted values.

Since the requirements for a one-way ANOVA are not sufficiently met, the hypothesis will be tested with the non-parametric Kruskal-Wallis-Test instead.

Kruskal-Wallis-Test

kruskal.test(fertility ~ group, data = UN_data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  fertility by group
## Kruskal-Wallis chi-squared = 83.377, df = 2, p-value < 2.2e-16
kruskal.test(lifeExpF ~ group, data = UN_data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  lifeExpF by group
## Kruskal-Wallis chi-squared = 115.11, df = 2, p-value < 2.2e-16

The test result includes the chi-square value, the degree of freedom and the p-value. A significant p-value indicates that at least one group differs significantly from the other groups.

  • Based on p-value < 2.2e-16 <0.05, it can be concluded that there is a statistically significant difference between the birth rates by country group.

  • Based on p-value < 2.2e-16 <0.05, it can be concluded that there is a statistically significant difference between life expectancy by country group.

post-hoc pairwise comparisons

dunn.test(UN_data$fertility, UN_data$group, method = "bonferroni")
##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 83.3766, df = 2, p-value = 0
## 
## 
##                            Comparison of x by group                            
##                                  (Bonferroni)                                  
## Col Mean-|
## Row Mean |     africa       oecd
## ---------+----------------------
##     oecd |   8.369592
##          |    0.0000*
##          |
##    other |   7.368734  -3.241147
##          |    0.0000*    0.0018*
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2
dunn.test(UN_data$lifeExpF, UN_data$group, method = "bonferroni")
##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 115.1114, df = 2, p-value = 0
## 
## 
##                            Comparison of x by group                            
##                                  (Bonferroni)                                  
## Col Mean-|
## Row Mean |     africa       oecd
## ---------+----------------------
##     oecd |  -10.44108
##          |    0.0000*
##          |
##    other |  -7.390998   5.534292
##          |    0.0000*    0.0000*
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2

If the p-value for a pairwise comparison is smaller than the significance level divided by the number of comparisons (alpha/2 in this case), then the null hypothesis will be rejected. The null hypothesis states that there is no difference between the groups being compared. The Bonferroni correction is used to reduce the probability of wrong results. All pairwise comparisons indicate significant differences between African countries and OECD countries and others in terms of fertility rate and life expectancy. Furthermore,the negative values in the table indicate that the mean life expectancy of women in African countries is lower than in the OECD and other countries.

0.3.2 H3

H3: There are significant differences between the regions Asia and Europe in terms of the urbanisation rates.

0.3.2.1 mean and sd by region

mean_overview_region <- UN_data %>%
  select(region, pctUrban) %>%
  group_by(region) %>%
  mutate(N = n()) %>% 
  summarise_all(mean)

print(mean_overview_region)
## # A tibble: 7 Ă— 3
##   region        pctUrban     N
##   <fct>            <dbl> <dbl>
## 1 Africa            42.4    52
## 2 Asia              57.4    50
## 3 Caribbean         57.8    13
## 4 Europe            70.5    39
## 5 Latin Amer        70      20
## 6 North America     82       2
## 7 Oceania           52.2    17
sd_overview_region <- UN_data %>%
  select(region, pctUrban) %>%
  group_by(region) %>%
  summarise_all(sd)

print(sd_overview_region)
## # A tibble: 7 Ă— 2
##   region        pctUrban
##   <fct>            <dbl>
## 1 Africa           17.7 
## 2 Asia             25.1 
## 3 Caribbean        25.1 
## 4 Europe           13.1 
## 5 Latin Amer       16.9 
## 6 North America     1.41
## 7 Oceania          28.3

0.3.2.2 Visualization

Examinnation of the relationships between the variables using a boxplot.

ggplot(UN_data, aes(x = region, y = pctUrban, color = region)) +
  geom_boxplot() +
  labs(
    x = "Region",
    y = "Urbanisation rate in %"
  )+
   scale_color_manual(values = c("#c0c0c0", "#FC4E07", "#c0c0c0", "#E7B800", "#c0c0c0", "#c0c0c0", "#c0c0c0"))+
  theme_minimal()+
  guides(color = FALSE)

0.3.2.3 Hypothesis testing

Checking the prerequisites for an ANOVA

  • Normal distribution of the residuals

  • Homogeneity of variance (Homoscedasticity)

Testing the residuals for normal distribution

model_urban <- aov(pctUrban ~ region, data = UN_data)

#Q-Q-Plot
ggqqplot(resid(model_urban))

#shapiro-Wilk-Test
residuals_model <- shapiro.test(resid(model_urban))
print(residuals_model)
#p-value = 0.2188 >0.05 indicates normal distribution

The residuals of the dependent variable seems to be approximately normally distributed.

Testing for Homoscedasticity

leveneTest(data = UN_data, pctUrban ~ region)

# p=1.378e-05 < 0.05 Homogeneity of variance is not assumed
plot(model_urban, 1)

Variance homogeneity can be assumed (even though the levene test had a different result) because the data points are relatively evenly distributed. Since the requirements for a one-way ANOVA, normal distribution of the residuals and variance homogeneity are met, the one-way ANOVA will be used to test the hypothesis.

one-way ANOVA

anova(model_urban)

The p-value 1.222e-08 indicates that there are significant differences between the regions, which means that at least one of the regions differs statistically significantly from the others.

post-hoc pairwise comparisons

UN_data %>%
  pairwise_t_test(pctUrban~region,
                  pool.sd = TRUE,
                  p.adjust.method = "bonferroni") %>%
  as.data.frame()

Calculating the effect size How big are the differences between Africa and Asia in terms of the urbanisation rate?

subset_data <- subset(UN_data, region %in% c("Africa", "Asia"))

#cohens d comparison between Africa and Asia
cohen.d(pctUrban ~ region, data = subset_data)
## 
## Cohen's d
## 
## d estimate: -0.6927299 (medium)
## 95 percent confidence interval:
##      lower      upper 
## -1.0973001 -0.2881598

Cohen’s d measures the standard deviation of the differences between the groups in relation to the mean standard deviation. A value of -0.6927299 indicates that there is a medium effect.